Istio causing problems with dispatching SpiceDB #spicedb

Istio causing problems with dispatching

williamdclt

05/10/2024, 11:27 AM

Hello! I've been running SpiceDB with dispatch disabled for a while now. I've tried re-enabling it again, and hit the same problem as I did in the past: it's far worse than having it disabled! See these screenshots (dispatch enabled at 11:57, rolled back at 12:07) - Latency goes through the roof, at all percentiles. Eg the p50 is ~6x worse, P99 ~12x worse (ignoring the massive spike that happened at 12:00, which is also a problem in itself) - Availability goes way down. I'm seeing logs (3rd and 4th screenshots) that are probably related. I'm not getting that many of these logs (~500 in the period where dispatch was enabled) - cache hit rate doesn't improve at all, it's still ~30%. I was expecting the dispatch cluster to mean better caching? I see 3 possible scenarios here: - Dispatch just doesn't work - Dispatch doesn't work for my particular workload - I'm doing something wrong Note: it's nothing new, dispatch never worked for me since at least 1.13.0 What do you think, any pointer? https://cdn.discordapp.com/attachments/844600078948630559/1238452244186402876/image.png?ex=663f5608&is=663e0488&hm=b31ecbc585fbc9baa34f83f4fb5e786d1cbe51680e1a0ffaa1bdf96651a8b00f& https://cdn.discordapp.com/attachments/844600078948630559/1238452244454834186/image.png?ex=663f5608&is=663e0488&hm=ef5c32a3a999affada4bacaf91279e2936d4864a60b456376d84fd00af42566d& https://cdn.discordapp.com/attachments/844600078948630559/1238452244769411092/image.png?ex=663f5608&is=663e0488&hm=5cf3fc8ca626f8c68c257da2e8a2b958edfa90e25de6694efb5994f19a4d5b64& https://cdn.discordapp.com/attachments/844600078948630559/1238452245100499056/image.png?ex=663f5608&is=663e0488&hm=27bf486675cb6f0c39a30f60f2feefa5ea205ce3b82f465b660050d9566b6b48& https://cdn.discordapp.com/attachments/844600078948630559/1238452245411135488/image.png?ex=663f5608&is=663e0488&hm=5ea2fd135cac219ee24895ff6f4ea29be6915a1d99358f364973875542a5761a&

vroldanbet

05/10/2024, 11:45 AM

We've never not run without dispatch, and all our managed offerings run with dispatch enabled, without errors nor latency impact when the clusters roll. That does not look right to me. I'd probably use opentelemetry to trace the requests and see where the time is being spent. I'd suspect something up with your Kubernetes setup / networking. What does your setup look like? are you using istio? Sidecars?

williamdclt

05/10/2024, 12:28 PM

We do use Istio (via sidecars). It's not my domain of expertise, but I've learned to look at istio suspiciously everytime something goes wrong indeed

williamdclt

05/10/2024, 12:28 PM

Lemme see if I can get OTel working again

williamdclt

05/10/2024, 12:35 PM

For my understanding, what sort of cache hit rate do you see? Is it highly dependent on data/workload or does it tend to be pretty consistent?

vroldanbet

05/10/2024, 12:41 PM

We've had reports from folks having issues with istio, and removing it fixing them Istio does not make a lot of sense here since this is an internal API to SpiceDB, each SpiceDB version knows how to talk to each other, and it's not meant to be used like a public API. Unless you have a specific networking setup, I don't see the value in using it with dispatch.

vroldanbet

05/10/2024, 12:42 PM

my guess it's related to istio, it adds overhead in the request path. I'd suggest trying without it.

vroldanbet

05/10/2024, 12:42 PM

Cache hit ratio is highly dependent on your workload. And again, we want to stress that SpiceDB caching mechanism is a hot-spot caching: it's a mechanism to dedupe requests

vroldanbet

05/10/2024, 12:44 PM

20-30% seems like the type of caching we see.

williamdclt

05/10/2024, 12:47 PM

the istio thing makes sense. I'll ask for support to my infra team to try removing it

vroldanbet

05/10/2024, 12:47 PM

is your setup perhaps crossing kube-cluster boundaries using istio? like cross datacenter?

williamdclt

05/10/2024, 12:47 PM

It's all in the same cluster. Pods could be on different AZ though

vroldanbet

05/10/2024, 12:48 PM

that's ok, networking overhead between AZ should be in the realm of submillisecond latency in most cloud providers

williamdclt

05/10/2024, 12:51 PM

given I'm already at ~30% cache, what benefits should I expect from a working dispatch cluster?

vroldanbet

05/10/2024, 1:04 PM

Horizontal scalability and better usage of your database

vroldanbet

05/10/2024, 1:12 PM

You are basically looking at reducing your db usage by in the worst case

williamdclt

05/10/2024, 1:27 PM

It's very much what I'm after, but I'm not really understanding how if we don't expect the cache hit rate to go up 🤔

vroldanbet

05/10/2024, 1:45 PM

Your requests are load balanced across 3 caches. If you have to solve sub problem A, that sub problem will have to be solved by each SpiceDB, which triples the number of times it needs to access the database.

vroldanbet

05/10/2024, 1:46 PM

The caches are both client and server side, caches at both ends of the dispatch ring. So effectively you end up with caches populated the same way. The difference is the amount of work you had to do to populate them.

williamdclt

05/10/2024, 1:56 PM

I'll have to ponder over that but sounds reasonable! Thank you

Joey

05/10/2024, 2:41 PM

it also allows singleflight to truly work

Joey

05/10/2024, 2:41 PM

rather than, if you're running three pods, it becoming "3-flight"

yetitwo

05/10/2024, 4:02 PM

iirc istio also behaves as an internal load balancer and that could be causing problems

Joey

05/10/2024, 4:12 PM

yeah, we've had multiple reports of problems with dispatching and sidecars like istio

williamdclt

05/13/2024, 4:11 PM

We've disabled Istio, seems like the problem is indeed gone 👍 Cache hit rate went up to ~50% and database CPU load is reduced by almost half 🙌 Thank you for the pointer!

Joey

05/13/2024, 4:22 PM

perfect

vroldanbet

05/14/2024, 8:46 AM

nice!

54 Views

Previous Next